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Abstract. Materialized view selection is a non-trivial task. Hence, its 
complexity must be reduced. A judicious choice of views must be cost- 
driven and influenced by the workload experienced by the system. In 
this paper, we propose a framework for materialized view selection that 
exploits a data mining technique (clustering), in order to determine clus- 
ters of similar queries. We also propose a view merging algorithm that 
builds a set of candidate views, as well as a greedy process for selecting 
a set of views to materialize. This selection is based on cost models that 
evaluate the cost of accessing data using views and the cost of storing 
these views. To validate our strategy, we executed a workload of decision- 
support queries on a test data warehouse, with and without using our 
strategy. Our experimental results demonstrate its efficiency, even when 
storage space is limited. 



1 Introduction 

Among the techniques adopted in relational implementations of data warehouses 
to improve query performance, view materialization and indexing are presumably 
the most effective ones [16]. Materialized views are physical structures that im- 
prove data access time by precomputing intermediary results. Then, user queries 
can be efficiently processed by using data stored within views and do not need 
to access the original data. Nevertheless, the use of materialized views requires 
additional storage space and entails maintenance overhead when refreshing the 
data warehouse. 

One of the most important issues in data warehouse physical design is to 
select an appropriate set of materialized views, called a configuration of views, 
which minimizes total query response time and the cost of maintaining the se- 
lected views, given a limited storage space. To achieve this goal, views that are 
closely related to the workload queries must be materialized. 

The view selection problem has received significant attention in the literature. 
Researches about it differ in several points: (1) the way of determining candidate 
views; (2) the frameworks used to capture relationships between candidate views; 
(3) the use of mathematical cost models vs. calls to the query optimizer; (4) view 



selection in the relational or multidimensional context; (5) multiple or simple 
query optimization; and (6) theoretical or technical solutions. 

The classical papers in materialized view selection introduce a lattice frame- 
work that models and captures dependency (ancestor or descendent) among 
aggregate views in a multidimensional context [2,6,11,14,22]. This lattice is 
greedily browsed with the help of cost models to select the best views to ma- 
terialize. This problem has been firstly addressed in one data cube and then 
extended to multiple cubes [17]. Another theoretical framework called the AND- 
OR view graph may also be used to capture the relationships between views [9, 
5,10,15,23]. The majority of these solutions are theoretical and are not truly 
scalable. In opposition to these studies, we exploit a query clustering involving 
similarity and dissimilarity measures defined on the workload queries, in order 
to capture the relationships existing between the candidate views derived from 
this workload. This approach is scalable thanks to the low complexity of our 
clustering (log linear regarding the number of queries and linear regarding the 
number of attributes). 

A wavelet framework for adaptively representing multidimensional data cubes 
has also been proposed [19]. This method decomposes data cubes into an indexed 
hierarchy of wavelet view elements that correspond to partial and residual ag- 
gregations of data cubes. An algorithm greedily selects a non-expensive set of 
wavelet view elements that minimizes the average processing cost of the queries 
defined on the data cubes. In the same spirit, Kotidis et al. proposed the Dwarf 
structure, which compresses data cubes [18]. Dwarf identifies prefix and suffix 
redundancies within cube cells and factors them out by coalescing their stor- 
age. Suppressing redundancy improves the maintenance and interrogation costs 
of data cubes. These approaches are very interesting, but they are mainly fo- 
cused on computing efficient data cubes by changing their physical design. In 
opposition, we aim at optimizing performance in relational warehouses without 
modifying their design. 

Other approaches detect common sub-expressions within workload queries 
in the relational context [3,7, 16,20]. The problem of view selection consists in 
finding common subexpressions corresponding to intermediary results that are 
suitable to materialize. However, browsing is very costly and these methods are 
not truly scalable with respect to the number of queries. 

Finally, the most recent approaches arc workload-driven. They syntactically 
analyze the workload to enumerate relevant candidate views [1]. By calling the 
query optimizer, they greedily build a configuration of the most pertinent views. 
A workload is indeed a good starting point to predict future queries because these 
queries are probably within or syntactically close to a previous query workload. 
In addition, extracting candidate views from the workload ensures that future 
materialized views will probably be used when processing queries. 

Our approach is also workload-driven. Its originality lies in exploiting knowl- 
edge about how views can be used to resolve a set of queries to cluster these 
queries together. For this purpose, we define the notion of query similarity 
and dissimilarity in order to capture closely related queries. These queries are 



grouped in the same cluster and are used to build a set of candidate views. 
Furthermore, these candidate views are merged to resolve multiple queries. This 
merging process can be seen as iteratively building a lattice of views. The merg- 
ing process time can be expensive when the number of candidate views is high. 
However, we apply merging over candidate views present in each cluster instead 
of the whole set of candidate views as in [1] . This reduces the complexity of the 
merging process, since the number of candidate views per cluster is significantly 
lower. 

The remainder of this paper is organized as follows. We first present in Sec- 
tion 2 our materialized view selection strategy. Then, we show in Section 3 how 
we build a candidate view configuration through our merging process. Next, we 
detail in Section 4 the cost models used for building the final configuration of 
views to materialize. To validate our approach, we also present some experiments 
in Section 7. We finally conclude and provide research perspectives in Section 8. 

2 Strategy for materialized view selection 

The architecture of our materialized view selection strategy is depicted in Fig- 
ure 1. We assume that we have a workload composed of representative queries 
for which wc want to select a configuration of materialized views in order to 
reduce their execution time. The first step is to build, from the workload, a con- 
text for clustering. This context is modelled as a matrix having as many lines 
as the extracted queries and as many columns as the extracted attributes from 
the whole set of queries. Wc define similarity and dissimilarity measures that 
help clustering together relatively similar queries. We apply a merging process 
on each query cluster to build a configuration of candidate views. Then, the final 
view configuration is created with a greedy algorithm. This step exploits cost 
models that evaluate the cost of accessing data using views and the cost of their 
storage. 

2.1 Query workload analysis 

The workloads we consider are sets of GPSJ (Generalized Projection-Selection- 
Join) queries. A GPSJ query q is composed of joins, selection predicates and 
aggregations. As such, it may be expressed in relational algebra over a star 
schema as follows: q = ttg,m<ts(F 1X1 D\ X D 2 X . . . x Dj), where S is a 
conjunction of simple range predicates on dimension table attributes, G is a 
set of attributes from dimension tables Di (grouping set), and M is a set of 
aggregated measures each defined by applying aggregation operator to a measure 
in fact table F. For example, query qi in Figure 2 may be expressed as follows: 

9l = 71 ' sales.time_id,sum(quantity_sold) a ' fiscal -day=2(sdleS X times). 

The first step consists in extracting from the workload the attributes that 
are representative of each query. We mean by representative attributes those 
that are present in Where (join and selection predicate attributes) and Group 
by clauses. We also save for each query their aggregation operators and joined 
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Fig. 1. Strategy of materialized view selection 



tables. A query qi is then seen as a line in a matrix composed of cells that 
correspond to the representative attributes. The general term gy of this matrix 
is set to 1 if the extracted attribute is present in the query and to otherwise. 
This matrix represents our clustering context. Moreover, we store in an appendix 
matrix the existing associations between the join attributes and queries, in the 
same manner. We illustrate this step by an example: from the workload shown 
in Figure 2, we build the clustering context depicted in Figure 3. 

2.2 Building the candidate view set 

In practice, it is hard to search all the views that are syntactically relevant 
(candidate views) from the workload queries, because the search space is very 
large [1]. To reduce the size of this space, we propose to cluster the queries. In- 
deed, we group in a same cluster all the queries that are closely similar. Closely 
similar queries are queries having a close binary representation in the query- 
attribute matrix. Two closely similar queries can be resolved by using only one 
materialized view. Used within a clustering process, the similarity and dissimi- 
larity measures defined in the next section ensures that queries within the same 
cluster strongly relate to each other whereas queries from different clusters are 
significantly distant to each other. 

Similarity measure. Let QA be a query-attribute matrix that has a set of 
queries Q = {qi,i = l..n} as rows and a set of attributes A = {aj,j = 



(q±) select sales. time Jd, sum(quantity_sold) from sales, times ^^^^^^^^^^ = ^^^^^^^^^^ 
where sales. timejd — times. timc_id and times. fiscal_day — 2 
group by sales. time_id; 

(92) select sales. prod_id, sum(amount_sold) from sales, products, promotions 

where sales. prod_id — products. prod_id and sales. promo_id — promotions. promo_id and 
promotions. promo_catcgory — 'newspaper' 
group by sales. prod_id; 

(93) select sales. cust_id, sum(amount_sold) from sales, customers, products, times 

where sales. cust_id — customers, cust Jd and sales. prod_id — products, prod Jd and 
sales. timejd — times. timejd and times. fiscal-day — 3 and customers. cust_marital_status 
— 'single' and products. prod_category —'Women' 
group by sales. cust Jd; 



Fig. 2. Example of workload 
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Fig. 3. Example of clustering context 



as columns. The value q^ is equal to 1 if attribute a 3 - is extracted from query 
qi. Otherwise, is equal to 0. We describe query by a vector of p values 
Qi = [in, ■■■,Qip\- These p values describe respectively the presence (q^ = 1) or 
absence (q^ = 0) of attribute aj. This description model helps comparing two 
queries. Then, for example, we can consider queries q\ and qi as closely similar 
if vectors [qu, qi v ) and [(721, ?2p] have the majority of their cells equal. This 
introduces the notion similarity and dissimilarity between queries. 



Similarity and dissimilarity between queries. We define the notion of 
similarity and dissimilarity between queries by two functions 5 S imj{<ik,<li) and 
fidissirrij (qk,qi) that measure the similarity between two queries q k and qi with 
respect to attribute aj. 



:(qk,qi) 



1 if q kj = q tj = 1 
otherwise 



This first function defines the notion of similarity between q k and qi following 
attribute aj: two queries qk and qi are considered similar regarding attribute aj 
if and only if q k j = qij = 1, i.e., attribute aj is extracted from both queries. 

if, 



6V (ak a,)- I Ult »j=% 



This second function defines the notion of dissimilarity between queries q k 
and qi according to attribute aj : two queries q k and qi are considered dissimilar 
according to attribute aj if only and if q k j ^ qij, i.e., if one and only one of the 
queries does not contain aj . Note that there is not a complete symmetry between 



the notion of similarity and dissimilarity: non similar queries according to an at- 
tribute are not necessarily dissimilar according to this attribute. For example, let 
qi and qi be queries such that qij = and q 2 j = 0, respectively. Then we have 
S S i mj (qi,q2) = (qi and q 2 are not considered similar) and 6 dissim:j (q 1 ,q2) = 
(qi and qi are not considered dissimilar). This absence of full symmetry under- 
lines the fact that the absence of the same attribute in two queries does not give 
an element of similarity or dissimilarity between these queries. 

These measures can be extended to an attribute set A = {a\, . . . , a p } such 
that wc get the degree of global similarity and dissimilarity between two queries: 
sim(q k ,qi) = Ylj=i ^si mj {qk, qi) and dissim(q k , qi) = Y^=i ^dissimjiQk, qi), where 
< sim(qk, qi) < p and < dissim(qk, qi) < p. Hence, the closer sim(q a ,qt,) 
(resp. dissim(q a , qb)) is to p the more q a and qb can be considered globally 
similar (resp. dissimilar). 



Similarity and dissimilarity between query sets. As we do for two queries, 
we introduce two functions that take into account the degree of similarity and 
dissimilarity between two query sets. A set of queries (subset of Q) is denoted C a . 
In order to translate the level of similarity (resp. dissimilarity) between query 
sets, we use function Sim(C ai Cb) (resp. Dissim(Ci, Cj)) that determines the 
number of similarities (resp. dissimilarities) between two different sets of queries 
C a and C b (C a + Cb): 

Sim(C a ,C b )= sim(q k ,qi) 
qk£C a ,qiec b 

Dissim(C a ,Cb) — ^ dissim(qk,qi) 
qk£C a ,qiec b 

where < Sim(C ai Cb) < card(C a ) x card{Cb) x p and < Dissim(C a ,Cb) < 
card(C a ) x card(Cb) x p. Hence, the closer Sim(C a , C b ) (resp. Dissim(C ai Cb)) 
is to card(C a ) x card(C b ) x p the more C a and Cb can be considered similar 
(resp. dissimilar). 



Similarity and dissimilarity within a query set. The notion of simi- 
larity (resp. dissimilarity) within a query set corresponds to the number of 
similarities (resp. dissimilarities) between queries of a same set C a . It con- 
sists of an extension of the similarity and dissimilarity functions defined be- 
tween query sets: Sim(C a ) — J2 qk ec a q t ec a k<l s ^ m (lk, qi) and Dissim{C a ) = 
T, qk ec a , qi ec a ,k<i dissim (lk,qi), where < Sim(C a ) < c °^(c a )x(carri(c a )-i)x P 
and < Dissim(C a ) < carrf ( c, ") x ( c ° rd ( c °)~ 1 ) >< P . Hence, the close Sim(C a ) (resp. 

DisSim{C a )) is tO ™rd(C a )x(card(C a )-l)xp ^ more ^ CQntains querics that &re 

globally similar (resp. dissimilar). 



Query clustering. Clustering involves the determination of groups of objects 
(here: queries) that reveal the the internal structure of data. These groups must 



be such as they are composed of objects with high similarity and objects from 
different clusters present a high dissimilarity. 

Let us consider clustering Ph of a query set, a quality measure of this clus- 
tering can be built as follows: 



Q(Ph) = 2^ Sim(C a ,C b ) + 2 t Dissim(C a ) 

a=l..z, »=i 
b— l..z,a < b 

0<Q(P„) <£-, ._, . carrf(C,) xcardiC,) Xp + V' c.rd( C< )x(c n rd( C< )-i)xp 

This measure permits to capture the natural aspect of a clustering. Hence, 
Q(Ph) measures simultaneously similarities between queries within the same 
cluster and dissimilarities between queries within different clusters. Thus, Q(Ph) 
evaluates simultaneously the homogeneity of clusters as well as the heterogeneity 
between clusters. Therefore, the clustering presenting a high intra-cluster homo- 
geneity and a high inter-cluster disparity has a weak value of Q{Ph) and thereby 
appears as the most natural. 

Jouve and Nicoloyannis proposed such a solution in the Kerouac clustering 
algorithm and its associated clustering quality measure [12]. We have chosen this 
algorithm because it has several interesting properties: (1) its computational 
complexity is relatively low (log linear regarding the number of queries and 
linear regarding the number of attributes) ; (2) it can deal with a high number 
of objects (queries) ; (3) it can deal with distributed data [13]. 



3 View merging process 



If we materialize all the different views derived from the query clusters obtained 
in the previous step, we can obtain a high number of materialized views, espe- 
cially if the number of queries within the workload is high. A view configuration 
obtained this way would not be very relevant if the storage space allotted by the 
data warehouse administrator was limited. Instead of materializing each view, it 
is better to only materialize views that can be used to resolve multiple queries. 
To solve this problem, we must enumerate the space of views that can be merged, 
determine how to guide the merging process, and finally build the set of merged 
views. View merging is relevant if the queries are strongly similar. As we cluster 
together closely similar queries, it is logical to apply the merging process on the 
set of queries present in each cluster. This significantly reduces the number of 
possible combinations when merging views. We detail in the following sections 
how we merge two views and then generalize this process to many views. 

Merging of view couples. The merging of two views must ensure these condi- 
tions: (1) all queries resolved by each view must also be resolved by the merged 
view, and (2) the cost of using the couple of views must not be significatively 
greater than the cost obtained when using the merged view. Let v\ and v 2 be a 



couple of views of the same cluster and sn, . . . , s\ m the selection predicates that 
are in V\ and not in v 2 . In a dual way, let s 2 i, ■ • ■ , s\ n be the selection conditions 
present in v 2 and not in v\. Merged view v\ 2 is obtained by applying Algorithm 1. 



Algorithm 1 Merge- View_Pair(v\ . vi ) 
1: put vi and V2 aggregation operations in V12 

operation aggregations 
2: put the union of projection and group by 

attributes vi and V2 in projection and group 

by clause of V12 
3: put all attributes sn,...,si m and 

S21, . . . , si n in the group by clause of V12 
4: put the selection predicates shared between 

vi and V2 in the selection predicate clause 

of V12 



Algorithm 2 Mergin_View_Generation 

1: M = Vi 

2: for (fc = 2; V fe _i ^ 0; fe + +) do 

3: Cfc = View_Gen(Vfc_i) 

4: M <- M U C k 

5: for all (view v G M) do 

6: Remove the parents of v from M 

7: end for 

8: end for 

9: return M 



The merging of two views v\ and v 2 is effective if cost(v 12 ) > ((cost(v\) + 
cost(v 2 )) * x). Cost computation is detailed in Section 4. The value of x is fixed 
empirically by the administrator. If it is small (resp. high), we privilege (resp. 
disadvantage) view merging. 

Property 1 The view obtained by merging views v\ and v 2 is the smallest view 
that resolves the query resolved by both v\ and v 2 . 

Proof. To show that the view obtained by merging views vi and V2 is the smallest 
view, we have to show that there is no view v' 12 such as the data within v' 12 are also 
included within vi2- We denote respectively views vi, v 2 and V12 ttg 1: m 1 (Ts 1 (Fx.. .), 
7rG2,M 2 o"s 2 (F [xi . . .) and ttgi2,m 12 o~s 1 2(F c>< . . .), respectively, where: 

- G\, G 2 are respectively the attribute set of the group by clause of views vi and v 2 ; 

- Si , S2 are respectively the attribute set of the selection predicates of v\ and V2 ; 

- G\2 — G\ U G2 U (5*1 U S2 — Si n S2 ) is the attribute set of the group by clause of 
merged view V12; 

- S12 — Si n S2 is the set of attribute selection predicates within merged view vi2- 

Note that sets G12 and S12 are obtained by applying lines 1 and 2 of Algorithm 1. 
Let us now assume that the data in view v'i 2 , denoted itg' 12 ,m' &s' 12 (F 1X1 • ■ ■) are au 
in vi2- This means that both of the following conditions hold: (1) G12 C G' 12 , (2) 

5*12 D S'i 2 - 

From the first condition, there is at least one attribute x such that x 6 G' 12 and 
x £ Gi2- As we have x G12, then x ^ Gi, x £ G 2 and x Si U (S2 — Si 0^2) because 
G12 — Gi U G 2 U (Si U S2 — Si n S2). As x £ Gi and x G2, then x is not in any clause 
of V2- This means that x ^ C7' 12 , which contradicts condition x G G'12. 

From the second condition, there is at least one attribute y such that y G S12 and 
y £ S'i2- As we have y € S12, then y £ Si and y G S2 because S12 = Si U S2. As y G Si 
and y £ S2, then y must be in all the predicates of the views obtained by merging vi 
and V2- This means that y G «Si 2 , which contradicts condition y £ S'i 2 - 

Merged view generation algorithm. The algorithm of view generation by 
merging is similar to algorithms searching for frequent itemsets. A frequent item- 
set lattice looks like a lattice of views within a given cluster. The lattice nodes 
represent the space of views obtained by merging. 



Algorithm 3 Function View-Gen(Vk — i) 

1: c fc = 

2: for all (view v £ Vfc_i) do 

3: for all (view u G Vfe_i) do 

4: if (v[l] = u[l] A ... A v[k — 2] — u[k — 

2] A v[k - 1] < u[fc - 1]) then 
5: c — Merge _View_P air 

6: if (cost(c) > ((cost(v)+cost(u))*x)) 

then 

7: C k = C k u {c} 

8: end if 

9: end if 

10: end for 

11: end for 

12: return Ck 



Algorithm 4 

View_Configuration_Construction 

1: S <- 
2: repeat 

3: twx <- 

4: F max «- 

5: for all Vj e V - S do 

6: if F /S {vj) > F ma x then 

7: fmoi <— F/s( v j) 

8: v max <— 

9: end if 

10: end for 

11: if F /S (tw x ) > then 

12: S^Su{v max } 

13: end if 

14: until (F /S (iw x ) < or V - S = 0) 



The algorithm of view generation by merging (Algorithm 3) uses an iterative 
approach by level to generate a new view. It explores the view lattice in breadth 
first. The input of the algorithm is V\, a set of candidate views extracted from 
a given cluster. This algorithm outputs a set of candidate views obtained by 
merging. In the k th iteration, view set 14- 1 obtained by merging the k — 1 th 
level's views from the lattice (computed in the last step) is used to generate the 
set Ck of fc-candidate views. This set is added to set M (line 4). The parents of 
each view obtained by merging are then removed from set M (lines 5 to 7) . 

The function for view generation by merging View_Gen(V4-i) , called on 
line 3, takes as argument Vfe-i and returns Ck- Two views v and u within Vk-i 
form a k-view c if and only if they have (fc — 2) views in common. This is 
expressed using a lexicographic order in the condition of line 3. We denote by 
v[l] . . . v[k— 2]v[k— 1] the merged views in the k th iteration that are used to derive 
V. Function Merge_View_Pair(u,u) (Algorithm 1) called on line 5 of View_Gen 
generates a new view c. The condition of line 6 ensures, after generating a k-view 
from two k — 1-views, that the candidate view does not have a cost greater than 
the cost of its parents. 



4 Cost models 



The number of candidate views is generally as high as the input workload is 
large. Thus, it is not feasible to materialize all the proposed views because of 
storage space constraints. To circumvent these limitations, we use cost models 
allowing to conserve only the most pertinent views. In most data warehouse cost 
models [8] , the cost of a query q is assumed to be proportional to the number of 
tuples in the view on which q is executed. In the following section, we detail the 
cost model that estimates the size of a given view. 

Let ms(F) be the maximum size of fact table F, \F\ be the number of tuples 
in F, Di_ID be a primary key of dimension Di, \DiJfD\ be the cardinality of 
the attribute(s) that form the primary key, and N be the number of dimension 
tables. Then, ms(F) = \D t JD\. 



Let ms(V) be the maximum size of a given view v that has attributes 
Oi, d2, ■ ■ ■ , <ik m hs group by clause, where k is the number of attributes in 
v and |aj| is the cardinality of attribute ai. Then, ms(v) — \ a i\- 

Golfarclli et al. [8] proposed to estimate the number of tuples in a given view 
v by using Yao's formula [24] as follows: 

\v\ = ms(v) x [l - n!-i ms{F iiT~iV ] , where d = 1 - -k^. If is sum- 

^ ' lli—l ms(F) — 2+1 1 msti?) msji?) 



ciently large, then Cardenas' formula [4] approximation gives: 



\v\=ms(v)x (l-(l- dw)' F ') 



, where = 1 • 

Cardenas' and Yao's formulaes are based on the assumption that data is 
uniformly distributed. Any skew in the data tends to reduce the number of tuples 
in the aggregate view. Hence, the uniform assumption tends to overestimate the 
size of the views and give a crude estimation. However, they have the advantage 
to be simple to implement and fast to compute. In addition, because of the 
modularity of our approach, it is easy to replace the cost model module by 
another more accurate one. 

From the number of tuples in v, we estimate its size, in bytes, as follows: 
size(v) = \v\ x YTi=i size(ci), where size(ci) denotes the size, in bytes, of column 
d of v, and c is the number of columns in v. 



5 Objective functions 

In this section, we describe three objective functions to evaluate the variation 
of query execution cost, in number of tuples to read, induced by adding a new 
view. The query execution cost is assimilated to the number of tuples in the fact 
table when no view is used or to the number of tuples in view(s) otherwise. The 
workload execution cost is obtained by adding all execution costs for each query 
within this workload. 

The first objective function advantages the views providing more profit while 
executing queries, the second one advantages the views providing more benefit 
and occupying the smallest storage space, and the third one combines the first 
two in order to select at first all the views providing more profit and then keep 
only those occupying the smallest storage space when this resource becomes 
critical. The first function is useful when storage space is not limited, the second 
one is useful when storage space is small and the third one is interesting when 
storage space is larger. 



5.1 Profit objective function 

Let V = {vi, ...,v m } be a candidate view set, Q = {qi,...,q n } a query set (a 
workload) and S a final view set to build. The profit objective function, noted 
P, is defined as follows: 

p /s( v j) = (C/ S {Q) - C /Su{v . } (Q) - 13 C mainten ance({vj})) , where Vj £ S. 



— C/ S (Q) denotes the query execution cost when all views in S are used. If 
this set is empty, C ' /${Q) = \Q\ x |F| because all the queries are resolved 
by accessing fact table When a view Vj is added to S, C/su{vj}(Q) = 

X^L=o C(qk, { v j}) denotes the query execution cost for the views that are in 
S U {vj}. If query qk exploits Vj, the cost C(qk,{vj}) is then equal to C Vj 
(number of tuples in Vj). Otherwise, C(qk, {vj}) is equal to the minimum 
value between |F| and values of C(qk, {v}) (executing cost of qk exploiting 
v E S with v ^ Vj). 

— Coefficient /? = \Q \ p(vj) estimates the number of updates for view Vj. The 
update probability p(«,-) is equal to r-V- ^update where %u P date 

^ L J r \ j i ~l number oj views /cquery 7 /oquery 

represents the proportion of updating vs. querying the data warehouse. 

— C m aintenance{{vj}) represents the maintenance cost for view Vj. 



5.2 Profit/space ratio objective function 

If view selection is achieved under a space constraint, the profit /space objective 
function R/ S (vj) — ) is used. This function computes the profit provided 

by Vj in regard to the storage space size(vj) that it occupies. 



5.3 Hybrid objective function 

The constraint on the storage space may be relaxed if this space is relatively 
large. The hybrid objective function H does not penalize space- "greedy" views 
if the ratio re T" mfl - space is lower or equal than a given threshold a (0 < a < 1), 

storagespace u ° v — ' 1 

where remaining space and storagespace are respectively the remaining space 
after adding Vj and the allotted space needed for storing all the views. This 
function is computed by combining the two functions P and R as follows: 

( P /S ( V A if fining .space 
H/a(ll ■] — < ' J storagespa.ee ' 

/s[ 3 > \R /S (vj) otherwise. 



6 View selection algorithm 

The view selection algorithm (Algorithm 4) is based on a greedy search within 
the candidate view set V. Objective function F must be one of the functions 
P or R described previously. If R is used, we add to the algorithm's input the 
space storage M allotted for views. 

In the first algorithm iteration, the values of the objective function are com- 
puted for each view within V. The view v max that maximizes F, if it exists 
(F/s(vmax) > 0), is then added to S. If R is used, the whole space storage M is 
decreased by the amount of space occupied by v max . 

The function values of F are then recomputed for each remaining view in 
V — S since they depend on the selected views present in S. This helps taking 
into account the interactions that probably exist between the views. 



We repeat these iterations until there is no improvement (F/ s (v) < 0) or until 
all views have been selected (V — S = 0). If function R is used, the algorithm 
also stops when storage space is full. 

7 Experiments 

In order to validate our approach for materialized view selection, we have run 
tests on a 1 GB data warehouse implemented within Oracle 9i, on a Pentium 
2.4 GHz PC with a 512 MB main memory and a 120 GB IDE disk. This data 
warehouse is composed of the fact table Sales and five dimensions Customers, 
Products, Times, Promotions and Channels. We executed on our data ware- 
house a workload composed of sixty-one decision-support queries involving ag- 
gregation operations and several joins between the fact table and dimension 
tables. Due to space constraints, the data warehouse schema and the detail of 
each workload query are available at http : //eric . univ-lyon2 . f r/~kaouiche/ 
adbis . pdf . Our experiments are based on an ad-hoc benchmark because, as far 
as we know, there is no standard benchmark for data warehouses. TPC-R [21] 
has no multidimensional schema and does not qualify, for instance. 

We first applied our selection strategy with the profit function. This function 
gives us the maximal number of materialized views (twelve views) because it 
does not specify any storage space constraint. This point gives us the upper 
boundary of the storage space occupation. Then, we applied the profit/space 
ratio and hybrid functions under a storage space constraint. We have measured 
query execution time with respect to the percentage of storage space allotted 
for materialized views. This percentage is computed from the upper boundary 
computed when applying the profit function. This helps varying storage space 
within a wider interval. 

Ratio profit/space function experiment. We plotted in Figure 4 the vari- 
ation of workload execution time with respect to the storage space allotted for 
materialized views. This figure shows that the selected views improve query exe- 
cution time. Moreover, execution time decreases when storage space occupation 
increases. This is predictable because we create more materialized views when 
storage space is large and thereby better improve execution time. We also observe 
that the maximal gain is equal to 94.86%. It is reached for a space occupation 
of 100% (no constraint on storage space). This case is also reached when using 
the profit function, because it corresponds to the upper boundary. 

Hybrid function experiment. We repeated the previous experiment with 
the hybrid objective function. We varied the value of parameter a between 0.1 
and 1 by 0.1 steps. The obtained results with a e [0.1,0.7] and a 6 [0.8,1] are 
respectively equal to those obtained with a — 0.1 and a = 0.8. Thus, we plotted 
in Figure 5 only the results obtained with a = 0.1 and a = 0.7. This figure shows 
that for percentage values of space storage under 18.6%, the hybrid function 




storage space (percentage) storage space (percentage) 



Fig. 4. Profit/space ratio function Fig. 5. Hybrid function 

with a = 0.1 and a — 0.8 behaves as the ratio function. When the storage space 
becomes critical, the hybrid function behaves as the ratio profit /space function. 
On the other hand, for the percentage values of storage space greater than 18.6%, 
the results obtained with a = 0.8 are slightly better than those obtained with 
a = 0.1. This is explained by the fact that for the high values of a, the hybrid 
function chooses the views providing the most profit and thereby improving the 
best the execution time. The maximal gain in execution time observed for the 
values 0.1 and 0.8 of a is equal to 96%. 

Selected view pertinence experiment. In order to show if our strategy 
provides pertinent views for a given workload, we measured the covering rate of 
the workload query results by the selected views. We mean by covering rate the 
ratio between the number of queries resolved from the materialized views and the 
total number of queries within the workload. Thus, the highest the rate value, 
the most pertinent the selected views. In this experiment, the percentage of 
storage space is also computed from the upper boundary. We plotted in Figure 6 
the covering rate according to storage space occupation. This figure shows that 
the covering rate increases with storage space. When storage space gets larger, 
we materialize more views and thereby we recover more query results from these 
views. When materializing all the views (100% storage space occupation), all 
the data corresponding to query results are recovered from the materialized 
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Fig. 6. Query covering rate by the selected materialized views 



views. This shows that, without storage space constraint, the selected views 
are pertinent. For example, for 0.05% storage space occupation, 22.95% of the 
query results are recovered from the selected views. This shows that, even for a 
limited storage space, our strategy helps building views that cover a maximum 
number of queries. This experiment shows that materialized view selection based 
on workload syntactical analysis is efficient to guarantee the exploitation of the 
selected views by the workload queries. 



8 Conclusion 



In this paper, we presented an automatic strategy for materialized view selection 
in data warehouses. This strategy exploits the results of clustering applied on 
a given workload to build a set of syntactically relevant candidate views. Our 
experimental results show that our strategy guarantees a substantial gain in 
performance. It also shows that the idea of using data mining techniques for 
data warehouse auto-administration is a promising approach. 



This work opens several future research axes. First, we are still currently 
experimenting in order to better evaluate system overhead in terms of material- 
ized view building and maintenance. The maintenance cost is currently derived 
from the query frequencies (Section 4). We are envisaging a more accurate cost 
model to estimate update costs. We also plan to compare our approach to other 
materialized view selection methods. Furthermore, it could be interesting to de- 
sign methods that select both indexes and materialized views, since these data 
structures are often used together. More precisely, we are currently developing 
methods to efficiently share the available storage space between indexes and 
views. Finally, our strategy is applied on a workload that is extracted from the 
system during a given period of time. We are thus performing static optimiza- 
tion. It would be interesting to make our strategy dynamic and incremental, as 
proposed in [14]. Studies dealing with dynamic or incremental clustering may be 
exploited. Entropy-based session detection could also be beneficial to determine 
the best moment to run view reselection. 
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